Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems
نویسندگان
چکیده
This paper studies parallel processing as a device for increasing fault tolerance. In the first of two basic models, a single job with a given running time is to be run on a finite set of processors; each processor is subject to failure but only while running a job. If a job is running on only one processor, and that processor fails, then the job must be restarted on another processor, assuming not all processors have already failed. To avoid such losses in accrued running time when at least two processors are available, it can be decided at any time to run the job synchronouslyon two processors in parallel, a replication technique we call shadowing. Clearly, shadowing has its own downside: while two processors are running, the failure rate is doubled. We show how to resolve this trade-off optimally; we devise a policy that schedules shadowing in such a way as to maximize the probability that the job finishes before all processors fail. We prove that the policy is of threshold type. That is, depending on the number of processors and the duration of the job, there is an optimal time to begin shadowing; once started, shadowing continues so long as neither processor fails and the job does not complete. We also show that the thresholds are monotone in the number of processors, i.e., if more processors are initially available, then shadowing should be started sooner. In the second of our two models, we have the same set-up except that we have an unbounded number of jobs, each having the same running time, and the objective is to maximize the expected number of jobs completed before all processors fail. We show that the optimal policy is again of threshold type, but that the thresholds are, surprisingly, not monotone in the number of processors. The optimal thresholds have a curious oscillatory behavior that we study in detail. Variants of the above problems are also analyzed using the same methods; several other variants are left as interesting open problems.
منابع مشابه
Real-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems
This article proposes a Distributed Realtime Fault-tolerant model, priority Real-time Fault-tolerant algorithm and computational architecture of Distributed Real-time Fault-tolerant. According to this model, the problem of how to schedule a weighted Directed Acyclic Graph (DAG) in Distributed computing system for high reliability can be solved in the presence of multiprocessors faults. When som...
متن کاملA Formal Description of FTAG for Multi-Processor Systems
FTAG is a functional model for writing fault-tolerant software that is based on attribute grammars. With this approach, a program is written as a series of module decompositions,with provisions for redoing and replicatingmodules used to implement fault-tolerance requirements. The functional nature of the model and the independence of decompositions makes FTAG especially well-suited for implemen...
متن کاملA fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration
Network-on-Chip (NoC) is widely used as a communication scheme in modern many-core systems. To guarantee the reliability of communication, effective fault tolerant techniques are critical for an NoC. In this paper, a novel fault tolerant architecture employing redundant routers is proposed to maintain the functionality of a network in the presence of failures. This architecture consists of a me...
متن کاملVoting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems
some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...
متن کاملDitto Processor
Concentration of design effort for current single-chip Commercial-Off-The-Shelf (COTS) microprocessors has been directed towards performance. Reliability has not been the primary focus. As supply voltage scales to accommodate technology scaling and to lower power consumption, transient errors are more likely to be introduced. The basic idea behind any error tolerance scheme involves some type o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Math. Oper. Res.
دوره 24 شماره
صفحات -
تاریخ انتشار 1999